{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正则表达式和re模块\n",
    "\n",
    "\n",
    "## 正则表达式\n",
    "\n",
    "[正则表达式](https://baike.baidu.com/item/%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F)是用来匹配字符串或者子串的一种模式，匹配的字符串可以很具体，也可以很一般化。\n",
    "\n",
    "Python 标准库提供了 re 模块。\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## re.match & re.search\n",
    "\n",
    "在 `re` 模块中， `re.match` 和 `re.search` 是常用的两个方法：\n",
    "\n",
    "```\n",
    "re.match(pattern, string[, flags])\n",
    "re.search(pattern, string[, flags])\n",
    "```\n",
    "\n",
    "两者都寻找第一个匹配成功的部分，成功则返回一个 `match` 对象，不成功则返回 `None`，不同之处在于 `re.match` 只匹配字符串的开头部分，而 `re.search` 匹配的则是整个字符串中的子串。\n",
    "\n",
    "## re.findall & re.finditer\n",
    "\n",
    "`re.findall(pattern, string)` 返回所有匹配的对象， `re.finditer` 则返回一个迭代器。\n",
    "\n",
    "## re.split\n",
    "\n",
    "`re.split(pattern, string[, maxsplit])` 按照 `pattern` 指定的内容对字符串进行分割。\n",
    "\n",
    "## re.sub\n",
    "\n",
    "`re.sub(pattern, repl, string[, count])` 将 `pattern` 匹配的内容进行替换。\n",
    "\n",
    "## re.compile\n",
    "\n",
    "`re.compile(pattern)` 生成一个 `pattern` 对象，这个对象有匹配，替换，分割字符串的方法。\n",
    "\n",
    "## 正则表达式规则\n",
    "\n",
    "正则表达式由一些普通字符和一些元字符（metacharacters）组成。普通字符包括大小写的字母和数字，而元字符则具有特殊的含义：\n",
    "\n",
    "| 子表达式   | 匹配内容                                           |\n",
    "| :--------- | :------------------------------------------------- |\n",
    "| `.`        | 匹配除了换行符之外的内容                           |\n",
    "| `\\w`       | 匹配所有字母和数字字符                             |\n",
    "| `\\d`       | 匹配所有数字，相当于 `[0-9]`                       |\n",
    "| `\\s`       | 匹配空白，相当于 `[\\t\\n\\t\\f\\v]`                    |\n",
    "| `\\W,\\D,\\S` | 匹配对应小写字母形式的补                           |\n",
    "| `[...]`    | 表示可以匹配的集合，支持范围表示如 `a-z`, `0-9` 等 |\n",
    "| `(...)`    | 表示作为一个整体进行匹配                           |\n",
    "| ¦          | 表示逻辑或                                         |\n",
    "| `^`        | 表示匹配后面的子表达式的补                         |\n",
    "| `*`        | 表示匹配前面的子表达式 0 次或更多次                |\n",
    "| `+`        | 表示匹配前面的子表达式 1 次或更多次                |\n",
    "| `?`        | 表示匹配前面的子表达式 0 次或 1 次                 |\n",
    "| `{m}`      | 表示匹配前面的子表达式 m 次                        |\n",
    "| `{m,}`     | 表示匹配前面的子表达式至少 m 次                    |\n",
    "| `{m,n}`    | 表示匹配前面的子表达式至少 m 次，至多 n 次         |\n",
    "\n",
    "例如：\n",
    "\n",
    "- `ca*t       匹配： ct, cat, caaaat, ...`\n",
    "- `ab\\d|ac\\d  匹配： ab1, ac9, ...`\n",
    "- `([^a-q]bd) 匹配： rbd, 5bd, ...`\n",
    "\n",
    "## 例子\n",
    "\n",
    "假设我们要匹配这样的字符串："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<re.Match object; span=(0, 11), match='hello world'>\n"
     ]
    }
   ],
   "source": [
    "string = 'hello world'\n",
    "pattern = 'hello (\\w+)'\n",
    "\n",
    "match = re.match(pattern, string)\n",
    "print(match)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "一旦找到了符合条件的部分，我们便可以使用 group 方法查看匹配的部分："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello world\n"
     ]
    }
   ],
   "source": [
    "if match is not None:\n",
    "    print (match.group(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "world\n"
     ]
    }
   ],
   "source": [
    "if match is not None:\n",
    "    print (match.group(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以改变 string 的内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello there\n",
      "there\n"
     ]
    }
   ],
   "source": [
    "string = 'hello there'\n",
    "pattern = 'hello (\\w+)'\n",
    "\n",
    "match = re.match(pattern, string)\n",
    "if match is not None:\n",
    "    print (match.group(0))\n",
    "    print (match.group(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通常，match.group(0) 匹配整个返回的内容，之后的 1,2,3,... 返回规则中每个括号（按照括号的位置排序）匹配的部分。\n",
    "\n",
    "如果某个 pattern 需要反复使用，那么我们可以将它预先编译："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "there\n"
     ]
    }
   ],
   "source": [
    "pattern1 = re.compile('hello (\\w+)')\n",
    "\n",
    "match = pattern1.match(string)\n",
    "if match is not None:\n",
    "    print (match.group(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "由于元字符的存在，所以对于一些特殊字符，我们需要使用 '\\' 进行逃逸字符的处理，使用表达式 '\\\\' 来匹配 '\\' 。\n",
    "\n",
    "但事实上，Python 本身对逃逸字符也是这样处理的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\\\n"
     ]
    }
   ],
   "source": [
    "pattern = '\\\\'\n",
    "print(pattern)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为逃逸字符的问题，我们需要使用四个 '\\\\\\\\' 来匹配一个单独的 '\\'："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['C:', 'foo', 'bar', 'baz.txt']\n"
     ]
    }
   ],
   "source": [
    "pattern = '\\\\\\\\'\n",
    "path = \"C:\\\\foo\\\\bar\\\\baz.txt\"\n",
    "print (re.split(pattern, path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "如果规则太多复杂，正则表达式不一定是个好选择。\n",
    "\n",
    "\n",
    "**Numpy 的 fromregex()**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing test.dat\n"
     ]
    }
   ],
   "source": [
    "%%file test.dat \n",
    "1312 foo\n",
    "1534    bar\n",
    "444  qux"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "fromregex(file, pattern, dtype)\n",
    "\n",
    "dtype 中的内容与 pattern 的括号一一对应："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(1312, b'foo') (1534, b'bar') ( 444, b'qux')]\n"
     ]
    }
   ],
   "source": [
    "\n",
    "pattern = \"(\\d+)\\s+(...)\"\n",
    "dt = [('num', 'int64'), ('key', 'S3')]\n",
    "\n",
    "from numpy import fromregex\n",
    "output = fromregex('test.dat', pattern, dt)\n",
    "print (output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "显示 num 项："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1312 1534  444]\n"
     ]
    }
   ],
   "source": [
    "print (output['num'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import os\n",
    "os.remove('test.dat')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
